A first look at the data shows that we have 13 columns and 7062 observations. Some of these columns are numeric, some are categorical. A good example for this project. Let’s start exploring European Footballs Strikers!
## 'data.frame': 7062 obs. of 13 variables:
## $ age : int 26 25 22 26 26 26 23 27 21 30 ...
## $ current.club : Factor w/ 193 levels "AC Milan","ACF Fiorentina",..: 193 193 193 193 193 161 161 161 160 161 ...
## $ current.league: Factor w/ 10 levels "Eredivisie","Jupiler Pro League",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ foot : Factor w/ 4 levels "","both","left",..: 4 2 2 4 4 2 4 3 3 4 ...
## $ height : int 183 180 179 183 188 174 175 178 180 183 ...
## $ name : Factor w/ 1173 levels "?der","?douard Duplan",..: 273 318 1013 49 114 919 899 674 1170 690 ...
## $ nationality : Factor w/ 285 levels "Albania","Albania Bulgaria",..: 225 11 11 225 225 196 39 212 225 39 ...
## $ position : Factor w/ 2 levels "CF","W": 2 2 1 1 1 2 2 2 2 1 ...
## $ season : Factor w/ 6 levels "2012-13","2013-14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ assists : int 1 NA NA 5 0 11 NA 5 NA 9 ...
## $ games : int 22 NA NA 26 5 42 NA 40 NA 32 ...
## $ goals : int 3 NA NA 13 1 17 NA 2 NA 15 ...
## $ minutes : int 1057 NA NA 2287 171 3678 NA 3588 NA 2455 ...
## age current.club current.league
## Min. :16.00 Sparta Rotterdam: 66 LaLiga : 828
## 1st Qu.:22.00 AS Monaco : 60 Eredivisie : 792
## Median :25.00 Royal Antwerp FC: 60 Ligue 1 : 780
## Mean :25.44 Akhmat Grozny : 54 Premier League : 756
## 3rd Qu.:28.00 AOK Kerkyra : 54 Jupiler Pro League: 690
## Max. :38.00 Asteras Tripolis: 54 Sªper Lig : 690
## (Other) :6714 (Other) :2526
## foot height name nationality
## : 318 Min. :163.0 Wanderson : 18 Spain : 504
## both : 714 1st Qu.:176.0 Leandrinho : 12 Brazil : 372
## left :1350 Median :180.0 William : 12 Italy : 294
## right:4680 Mean :180.6 ?der : 6 Russia : 270
## 3rd Qu.:185.0 ?douard Duplan : 6 Greece : 264
## Max. :204.0 ?mer Ali Sahiner: 6 Netherlands: 228
## NA's :108 (Other) :7002 (Other) :5130
## position season assists games
## CF:3420 2012-13:1177 Min. : 0.000 Min. : 0.00
## W :3642 2013-14:1177 1st Qu.: 1.000 1st Qu.:18.00
## 2014-15:1177 Median : 2.000 Median :27.00
## 2015-16:1177 Mean : 3.367 Mean :26.18
## 2016-17:1177 3rd Qu.: 5.000 3rd Qu.:35.00
## 2017-18:1177 Max. :31.000 Max. :66.00
## NA's :949 NA's :949
## goals minutes
## Min. : 0.000 Min. : 0
## 1st Qu.: 2.000 1st Qu.: 878
## Median : 5.000 Median :1662
## Mean : 6.671 Mean :1694
## 3rd Qu.: 9.000 3rd Qu.:2428
## Max. :61.000 Max. :5060
## NA's :949 NA's :949
Since goal is the ultimate “goal” of the game :), lets take a look at the histograms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 5.000 6.671 9.000 61.000 949
There are 2 interesting points here:
it is likely that most of the strikers do not score a goal, hence we have peaks at 0 goals
Goals scored has a right skewed charactheristics
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 2.000 3.367 5.000 31.000 949
Distribution of assists are similar to goals, and the reasoning behind is more or less the same. However let’s keep in mind that generally number of assists is 1/2 wrt goals. This is an expected result as assists mostly come from midfielders not strikers.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 878 1662 1694 2428 5060 949
Note that there is a peak close to 0 minutes. This may be due to several factors. Substitution, injuries, imperfection in data collection etc.
Distribution of games played is similar to the minutes, but the peak is a bit to the right. This is because of the obviouds correlation btw minutes and games played and the fact that whether a player plays 90 mins or 1 mins, games played is incremented.
Obviously there is more data after the season 2013-14. This suggests that we should look at the frequencies of games played, so that we can normalise the effect coming from more data as the seasons come close to present day.
Review note: * Note that total number of samples per each season is not equal. To be able to compare the histograms I divide samples by total samples. So that area under each curve is 1 and every X,Y value is comparable.
Note that, for seasons 2012-13 and 2013-14 number of zero minutes and games is relatively high. We must keep this in mind while investigating a feature over seasons.
Below I provide some more univariate information just to keep in mind on the following parts.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 22.00 25.00 25.44 28.00 38.00
Age histogram of players are as expected. They can become professional at the age of 16 and tend to retire after they are 30 years old.
%70 of the player in data are right footed, %20 is left and %10 can use both feet.
I beleive following variables are interesting to look at:
Lets add them to the dataset and plot histograms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0017 0.0032 0.0036 0.0050 0.0417 1018
Note that we filter goalsperminute = 0 cases, as this histogram is more relevant “if” goal is scored and a peak at 0 makes it hard to see actual charactheristics when one or more goal is scored.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0968 0.1905 0.2279 0.3243 2.0000 1016
Note again that we filter no goal cases. It is also important to note that max(goalpergame) = 2.0 does not mean that a player cannot score more than 2 goals per game. It is the ratio that players can reach given the goals they scored and the games thay played that season.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 49.40 63.24 60.93 74.93 120.00 1016
This histogram is again as expected. It is obvious that a football game is 90 minutes and 120 minute is not an outlier. It is due to the fact that some games can go to extra time. In fact we are looking at the leagues, not tournamets and this should not normally occur, but I will keep this as it is, because there may be a playy-off case or a similar game that takes 120 mins. Since it is very rare it will not be very critical in ou conclusions.
Data includes 7062 observations with 13 variables. 7 of these are factor variables. I added 3 computed variables from other columns.
Factor variables:
Other observations:
there was no need to ordered factors except for seasons. (alphabetical ordering was sufficient for all practical purposes)
2012-13 and 2013-14 seasons include a little bit more 0 minutes and games per player. This may be due to data collection method and/or because the player get more experienced and become more regular starters, hence take more minutes/play in more games.
The number of goals a player scores.
I would like to investigate if the time a player takes leads to higher goals. Is there a correlation btw assists and goals a player makes. How the goal performance of players is affected from other possible factors like league, position etc.
I expect that to be related to the time a player takes on the pitch and maybe position and in which league he plays. It is also interesting to see the affect of age. It is possible that more experienced players reach higher goal rates.
I created the following variables and added them to the dataframe:
The original data was wide, as the goals/assists/minutes/games were given as seperate columns for each season. I did a union like (of SQL) operation to append these and added a year column to make it easier to plot histograms across all seasons.
It was also not logical to look at goals/minute and goals/game when there is no goal scored. This was leading to very high peaks at 0 on histograms and making it hard to observe the distirbution characteristics of these varibles. So, while plotting the histograms of calculated variables I filtered the case where there is no goal.
I also observed that amount of NaNs in calculated columns increased as it is possible to have 0 minutes/games and this was leading to NaN. My histogram code is already filering this by “!is.na(my_variable)” so it was OK as it is.
Below is a scatter matrix via ggpairs() function to give us a general look.
Review Note: I agree that values are mixed here, but merely a corelation plot would not be sufficent here, because it helped me to see box plots of categorical variables like position.I am using this plot as a very rough start point.
## `geom_smooth()` using method = 'gam'
As expected, as the time that a player takes on pitch increases, total goals he scored increases.
Note also that, relationship is more of exponential especially after 3250 minutes (blue) on the pitch. Although variance increases at this region, this non-linear increase is not a coincidence. This is because more skilled players play almost every game. That is, they get more game time and they are better at scoring. These are the affects that create exponential relationship as we go beyond 3000 minutes.
This plot is also interesting. Note that, there is almost a horizonral line until 10 games. So it can be observed that, if a player plays in more than 10 games in a season, his goal scoring performance will be better (very roughly linearly proportional to the nr of games he plays). This may be due to the fact that, players who play less than 10 games per season are usually substitutes and/or less talented players.
Here we observed that as the player get more experienced total goals per season increases, but data becomes more noisy for ages < 20 and > 30 due to the decrease of samples. i.e. less footballers of these ages.
This suggests that, there is not much relation between the age of a player and the minutes he play per season. I would expect a little bit more correlation here, which was interesting to see that there is not.
This suggests that central forwars (CF) can on avarge score more goals per season wrt wingers. This is again an expected result. What about the assists they make?
And here it is. Wingers tend to make more assists compated to center forwards. This is again expected as they play on sides of the pitch and usually cross in appropriate situations so that CFs can score. CFs however tend to directly search for goal and unless the team is playing with 2 forwards their assist options will be less.
Now at first sight… there is not much to see here. But if we focus on outliers La Liga has higher outliers. If you are a football fan, you may have already started to think about Lionel Messi and Chirstiano Ronaldo. Let’s see who are these la liga guys?
## name goals season
## 2452 Cristiano Ronaldo 61 2014-15
## 94 Lionel Messi 60 2012-13
## 3627 Luis Suðrez 59 2015-16
## 2448 Lionel Messi 58 2014-15
## 98 Cristiano Ronaldo 55 2012-13
## 4802 Lionel Messi 54 2016-17
## 1275 Cristiano Ronaldo 51 2013-14
## 3629 Cristiano Ronaldo 51 2015-16
And you were right! Only with the exception of Luis Suarez in season 2015-16 where he also scored 59 goals.
It is also interesting to look at the efficieny of goal scoring. We define efficiency by mean value of goals per minute by league/club/player. i.e. how many goals per minute a player/club/league scores.
It is important to point out that I preferred to take only 2017-18 season for these computations. This is because, we have the “current team” column due to the fact that some players change teams every season. Structure of data is not appropriate to group over seasons for such computations.
Below are some interesting stats.
## # A tibble: 10 x 3
## current.league Mean n
## <fct> <dbl> <int>
## 1 Sªper Lig 0.366 115
## 2 Eredivisie 0.323 132
## 3 Serie A 0.319 112
## 4 Premier League 0.311 126
## 5 Jupiler Pro League 0.303 115
## 6 Ligue 1 0.301 130
## 7 LaLiga 0.300 138
## 8 Liga NOS 0.263 111
## 9 Super League 0.233 106
## 10 Premier Liga 0.218 92
The most surprising thing here was to see that Turkish Süper League is the place where the most efficient strikers play! This season, there occurs 0.37 goals for every 90 minutes of football played in the league. (Note: I multiplied goalsperminute*90 so that we can talk about goals per game which is more intiutive. But, this should not be confused by our actual “goalspergame”" column)
## # A tibble: 6 x 4
## # Groups: name [6]
## name current.club goals_per_minute total_minutes
## <fct> <fct> <dbl> <int>
## 1 Baf÷timbi Gomis Galatasaray SK 1.06 2369
## 2 Jonas SL Benfica 1.06 2881
## 3 Ciro Immobile SS Lazio 1.06 2882
## 4 Burak Yilmaz Trabzonspor 1.05 1536
## 5 Cristiano Ronaldo Real Madrid 1.03 2877
## 6 Rangelo Janga KAA Gent 1.01 1787
Here it is interesting to see the top 3. (Note: I filtered the results by taking players who played at least 900 minutes this season. ie. 10 effective game time)
## # A tibble: 6 x 5
## # Groups: current.club [6]
## current.club current.league Mean total_minutes n
## <fct> <fct> <dbl> <int> <int>
## 1 UC Sampdoria Serie A 0.816 4323 3
## 2 "Vit\u0081ria Set¦bal FC" Liga NOS 0.729 3044 4
## 3 Paris Saint-Germain Ligue 1 0.658 13046 5
## 4 Waasland-Beveren Jupiler Pro League 0.647 9070 8
## 5 Trabzonspor Sªper Lig 0.615 5385 5
## 6 Tottenham Hotspur Premier League 0.562 8137 5
Here we see there is not a direct bias from league, club or player in terms of effectiveness. Since number of teams, players change in each grouping we see different
Our primary feature of interest was the total number of goals scored by strikers. We observed that there is a roughly linear relationship with minutes played per season and goals scored. We also observed that this relationship becomes exponential as the minutes exceed 3250 minutes (35 games). This is because, players who play more than this much of games are exceptionally good players with high consistency.
Another strong relation that effected the total number of goals scored per season was the position. In this data there is ony two positions were given: CF, W. I observed that CF has higher mean wrt W.
Yes. I observed that wingers (W) tend to make more assits per season wrt forwards (CF).
I also oberved that the following relationships:
The number of goals per season is highly correlated with position and total minutes on pitch.
Note how CFs paint the upper part of the histogram. It is very obvious that their goal efficiency is higher.
When we break it to leagues, we see more or less same behaviour. No league stands out visually.
When we brea it int seasons again nothing stands out visually, except 2017-18 season. This is normal as it is not finished yet.
The seperaiton of positions is even more clear when we look at games played vs goals scored.
Breking into leagues does not reveal a lot again. Russian Premier Liga and Greek Super League are a bit behind others in terms of maximum golas scored. They obivously have less efficient strikers.
Now, let’s look at the assits the same way we did for goals:
The most important observation is that, assists are less seperated wrt position.
Serie A is a bit different here. I will add more detailed explanation of this in reflection part.
Age does not seem to be correlated in any visible sense with goals scored.
The most interesting strengthening feature was the position. It was clearly observed that central forwards (CF) are more involved in goals scored wrt wingers (W).
Effect of position in assists made was a bit in favor of wingers as expected, but the differentiation was not as clear as the effect of position on goals scored.
Also it was an iteresting observation to see that in Serie A (Italy) center forwards are much less involved in assists wrt other leagues of Europe.
N/A
Our main feature of exploration was the amount of goals scored per season. This histogram shows the most general charactheristics of strikers’ goal performance in football game. For European leagues (and I would strongly expect other continents be very similar) we found out that a striker scores on avarage 6.7 goals per season. Considering 3rd quantile is 9 goals, we can assume a striker is a decent one if he scores > 9 goals per season.
With the information provided in dataset, it turned out that a strong feature to affect total goals per season was the total minutes a player stayed on the pitch. Above scatter plot shows that until 3000 minutes of game time strikers total goals per season is linearlt proportional to the minutes taken.
However, a better fit reveals another interesting finding beyond 3000 minutes (approx. 30 games). Goal scoring performance of players who play more than 30 games per season is increasing exponentially. As I stated above, this is an interesting finging for me, but nonethless explainable. Because, very rare and skilled players play that much games per season and not only the minutes they take, but also their exceptional skills become more distinguishable.
After the hint given in scatter matrix, it was obvious that position was very relevant in the amonut of total goals scored per season. Therefore, it was necessary to further investigate the scatter plot of goals per minute wrt position. Color coding wrt position revealed even more interesting and clarifying results. We saw that center forwards contribute to the upper part of the scatter plot with clear difference.
As a football fan and enthusiastic data scientist “candidate”, it was very entertaining and educative to go through this EDA. While doing my own data set search, I encountered this dataset; that was a good choice in terms of structure but still needed to be a little bit modified in order to be able to work smoothly with R studio. Thanks to previous python (pandas) skills we learned, it was easy to shape data to be explored.
I really did not know what to expect from this data. I was not even sure how correct it was collected. The first dissapointing thing was that, Bundesliga (German League) was not involved. I guess their statistics are subject to copyrigt etc. Second suspicion I still have is the relative high amount of 0 goals and assists for the oldest seasons. I preferred to either filter these values or did use the most recent season depending on the content of analysis.
Other than that every step was either jaw dropping (at least for me :) ) or a mathematical confirmation of the intiutive expectation of an 36 year old football fan as myself. Follwing are my highlights:
10 goals per season is really a decent amount and this was what I exactly found out!
To see Messi and Ronaldo as an outlier in a boxplot was again a confirmation of how exceptional these players are
It was really really interesting to find out that our good old and not very popular Turkish League was in fact the league where the most efficient stirkers played.
To find out the afffect of position in multivariate analysis was somewhat expected, but never the less beatiful to see.
And about the title picture of this EDA. The picture belongs to Bafetimbi Gomis, center forward of Galatasaray who turned out to be the most efficient striker in Europe according to the goals per minute analysis I did.
Final highlight was somewhat technical and probably too much detail for a decent footbal fan (an I apologize from the evaluator of this EDA for my tactical passion), but after so many hours of coding and data exploration I have to point out that, it was also interesting to see how Serie A forwards were much behind other leagues in terms of assists. The unique and traditional charactheristics of italian football have long been reluctant to adapt to the modern tactics and this is to be seen not only their absence at the top of European football wrt 90s, but also to the very detail of may facet_wrap wrt current.league scatter diagram. Data speaks for it self and we would not be wrong if we deduce that, in the era of hard working and physically top level central forwards, Italy still has “poachers” whose sole aim is to be at the right place, at the right time.
I think to include only wingers and central forward lacks a bit of completeness. It could be logical to include attacking midfileders too, but then I think it would be harder to see such a nice scatter graph when we colored wrt position. For future analysis, it may be interesting to look at assists vs goals and define a more elaborate metric of a so called " complete" striker whose contirbution to his team is event more than just scoring goals.